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Abstract 

We present an unusual algorithm involving classification trees where two trees are grown in opposite directions 
so that they are matched at their leaves. This approach finds application in a new data mining task we formulate, 
' called redescription mining. A redescription is a shift-of-vocabulary, or a different way of communicating infor- 

mation about a given subset of data; the goal of redescription mining is to find subsets of data that afford multiple 
descriptions. We highlight the importance of this problem in domains such as bioinformatics, which exhibit an 
underlying richness and diversity of data descriptors (e.g., genes can be studied in a variety of ways). Our ap- 
proach helps integrate multiple forms of characterizing datasets, situates the knowledge gained from one dataset 
^11^ \ in the context of others, and harnesses high-level abstractions for uncovering cryptic and subtle features of data. 

' Algorithm design decisions, implementation details, and experimental results are presented. 
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Classification and regression trees (CART) were among the earliest proposed approaches for pattern classification and 
data mining |4l. While being powerful in terms of accuracy and efficiency of induction, their results are also simple 
to understand as they mimic the decision-making logic of human experts. The renewed emphasis on data mining 
propagated by the knowledge discovery in databases (KDD) community in the early 1990s has fueled a resurgence 
of interest in tree-based methods. Researchers have revisited tree induction algorithms in the context of datasets 
residing in secondary storage ISl fTOll . creating scalable and highly efficient implementations |3|. The many fielded 
applications of tree-based methods range from everyday uses such as spam filtering fT^ to astrophysical domains 
such as classifying galaxies fT4ll . 

In this paper we introduce a new data mining task — redescription mining — and also propose a novel tree-based 
algorithm (CARTwheels) for mining redescriptions. A redescription is a shift-of-vocabulary, or a different way of 
communicating information about a given subset of data; the goal of redescription mining is to find subsets of data 
that afford multiple descriptions. 

Consider the set of all countries in the world. The elements of this set can be described in various ways, e.g., 
geographical location, political status, scientific capabilities, and economic prosperity. Such descriptors allow us 
to define various subsets of the given (universal) set. A redescription involves a subset definable in two ways, for 
instance: 

'Countries with > 200 Nobel prize winners' 'Countries with > 150 billionaires' 



This redescription involves two descriptors, and says that the countries with more than 200 Nobel prize winners are 
also those countries with more than 150 billionaires. One country satisfies both descriptors, namely U.S.A., and we 



say that it has been redescribed. The strength of the redescription is given by the symmetric Jaccard's coefficient, 
which is the ratio of the size of the intersection of two descriptors to the size of their union (in this case, 1/1 = 1). 
Descriptors on either side of a redescription can involve more than one entity, e.g., 

'Countries with defense budget > $30 billion' 44> 'Permanent members of U.N. Security Council' 

This redescription is only approximate, however, since the left descriptor contains {U.S.A., U.K., Japan, France, 
Germany} and the right descriptor represents {U.S.A., U.K., Russia, France, China}. The Jaccard's coefficient is 
hence 3/7 = 0.428. 

To strengthen redescriptions such as above, we can use more selective descriptors: 

'Countries with declared nuclear arsenals' <^ 'Permanent members of U.N. Security Council' 

which improves the Jaccard's coefficient to 5/8 = 0.625 since the left descriptor now represents {U.S.A., U.K., Russia, 
France, China, India, Israel, Pakistan}. Another approach to strengthening is to form set-theoretic operations (union, 
intersection, difference) involving the given descriptors; e.g. the redescription 

'Countries with defense budget > $30 billion' n 'Countries with declared nuclear arsenals' 44> 
'Permanent members of U.N. Security Council' — 'Countries with history of communism' 

holds with Jaccard's coefficient 1. It refers to three countries: {U.S.A., U.K., France}. 

The inputs to redescription mining are the universal set of objects O and two sets {X and Y) of subsets of O. The 
elements of X are the descriptors Xi, and are assumed to form a covering of O (Ui Xi = O). Similarly Uj Yi = O. 
The only requirement of a descriptor is that it be a proper subset of O and denote some logical grouping of the 
underlying objects (for ease of interpretation). The goal of redescription mining is to find equivalence relationships 
of the form E ^ F that hold at or above a given Jaccard's coefficient 9 (i.e., > 9), where E and F are 

set-theoretic expressions involving Xj's and Yj's, respectively. For tractability purposes, some syntactic bias on the 
allowable set-theoretic expressions or their length is assumed to be provided. For instance, we might restrict E to only 
involve intersections of two descriptors from X and F to either an intersection or difference of two descriptors from 
Y. Redescription mining hence involves constructive induction (the task of inventing new features) and exhibits traits 
of both unsupervised and supervised learning. It is unsupervised because it finds conceptual clusters underlying data, 
and it can be viewed as supervised because clusters defined using descriptors are given meaningful characterizations 
(in terms of other descriptors). 

Why is this problem relevant? We posit that today's high-throughput data-driven sciences are drowning in not 
just the dimensionality of data, but also in the multitude of descriptors available for characterizing data. Consider 
gene expression studies using bioinformatics approaches. The universal set of genes in a given organism (O) can be 
studied in many ways, such as functional categorizations, expression level quantification using microaiTays, protein 
interactions, and biological pathway involvement. Each of these methodologies provides a different vocabulary 
to define subsets of O (e.g., 'genes localized in cellular compartment nucleus,' 'genes up-expressed two-fold or 
more in heat stress,' 'genes encoding for proteins that form the Immunoglobin complex,' and 'genes involved in 
glucose biosynthesis'). While traditionally we would custom-build data mining algorithms to work with each of 
these vocabularies, redescription mining provides a uniform way to characterize and analyze the results from any 
of them. In addition, it helps bridge diverse experimental methodologies by uniformly relating subsets across the 
corresponding vocabularies. 

We further argue that redescription mining serves as a fundamental building block of many important steps in 
the iterative, often unarticulated, knowledge discovery process. A shift of vocabulary allows a given subset of data 
to be interpretable in a different context, and allows us to harness existing knowledge from this other context. For 
instance, if we are able to redescribe results from a new stress experiment onto, say, a heat shock experiment studied 
earlier, we will be able to study the new results in terms of known biological knowledge about heat shock. Chains of 
redescriptions allow us to relate diverse vocabularies, through important intermediaries. 
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Even redescriptions that hold with Jaccard's coefficient < 1 find application in many domains. An approximate 
redescription implies a common meeting ground for two concerted communities of objects. A chain of such approxi- 
mate redescriptions can effectively relate two subsets that have nothing in common! This is especially useful in story 
telling and link analysis applications. A query such as 'what is the relationship between people traveling on Flight 
847 and the top 10 wanted list by the FBI?' can be posed in terms of redescription finding. 

While related problems have been studied in the data mining community (most notably, conceptual clustering |6l 
fl6l . niche finding, and profiling classes [221 ). we believe that the above formulation of redescription mining has not 
been attempted before. Our contributions here are both the introduction of this new data mining problem, as well as 
a novel tree-based algorithm for mining redescriptions. 

2 Redescription Mining as Alternating Tree Induction 

We now introduce an approach (CARTwheels) to mining redescriptions that involves growing two trees in opposite 
directions, so that they are matched at their leaves. The decision conditions in the first tree (say, top) are based on set 
membership checks in entries from X and the bottom tree is based on membership checks in entries from Y; thus 
matching of leaves corresponds to a potential redescription. This idea hence uses paths in the classification trees as 
representations of boolean expressions involving the descriptors. 

The CARTwheels algorithm is an alternating algorithm, in that the top tree is initially fixed and the bottom tree 
is grown to match it. Next, the bottom tree is fixed, and the top tree is re-grown. This process continues, spouting 
redescriptions along the way, until designated stopping criteria are met. 

2.1 Working Example 

For ease of illustration, consider the artificial example in Fig.^that shows two sets of descriptors for the universal set 
O = {01,02,03,04, 05}. Here, the set X corresponds to the set of descriptors {Xi, X2, X^, X^} and Y corresponds 
to {Yi,Y2,Y3, Y4}. The cardinalities of X and Y may not be the same in the general case. Further, in a realistic 
application, the number of descriptors would far exceed the number of objects. 

To initialize the CARTwheels alternation, we prepare a traditional dataset for classification tree induction, where 
the entries correspond to the objects, the boolean features are derived from one of X or Y, and the classes are derived 
from the other. In the dataset shown in Fig. |2l (left), the features correspond to set membership in entries of Y and 
each object is assigned a unique class, chosen from the Xi's it participates in. We employed a greedy set covering of 
the objects using the entries of X in order to establish the class labels in Fig.|2l(left). For instance, 02 belongs to both 
Xi and X^, but the tie is broken in favor of Xi. Notice that in this process, X^ does not receive any representation in 
the prepared dataset. 

A classification tree can now be grown using any of the impurity measures studied in the literature (e.g., entropy, 
Gini index, misclassification rate). Fig. |2l (right) depicts a possible tree. The leaves of the tree deterministically 
predict a class label from X, typically the majority class. At this point, the specific details of how the tree was 
induced are not important, only that any such tree will induce a partition of the underlying objects. In this case, the 
tree induces a 3-partition which mirrors the 3-class partition present in the original dataset, but is not exactly the 
same. The left most path corresponds to the region Y^nY2, the right most path corresponds to O — la — Yi, and the 
union of the two middle paths gives (13 — Y2) U (Yi — I3). The reader can verify that these regions do not have a 
one-to-one correspondence with the regions Xi, X2, and X4 in the original partition. For instance, only X2 enjoys 
such a correspondence, with O — Y^ — Yi. In 'reading off a partition from a tree in this manner, a conjunction thus 
results from a path of length > 1, a disjunction results from multiple paths predicting the same class, with negations 
corresponding to following the 'no' branch from a given node. This partition is used as the starting point for the 
alternation (Fig.|3 first frame). 
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-'^l = { 02, 03 } 
X2 = { 03, 04 } 

^3 = { 02, 04 } 

X4 = { 01, 05 } 

Figure 1 : Example data for illustratin; 



Yi = { oi, 02, } 

Y2 = { 02, 03, 04 } 

^3 = { 03, 05 } 

Yi = { oi, 02, 05 } 

operation of CARTwheels algorithm. 
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Figure 2: (left) Dataset to initialize CARTwheels algorithm, (right) induced classification tree. 
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Figure 3: (left) Dataset for second iteration of CARTwheels algorithm. Notice that class labels are now set-theoretic 
expressions involving l^'s. (right) Dataset for third iteration of CARTwheels algorithm. 




Figure 4: Alternating tree growing in the CARTwheels algorithm. The alternation begins with a tree (first frame) 
defining set-theoretic expressions to be matched. The bottom tree is then grown to match the top tree (second 
frame), which is then fixed, and the top tree is re-grown (third frame). Colored arrows indicate the matching paths. 
Redescriptions con-esponding to matching paths at every stage are read off and subjected to evaluation by Jaccai^d's 
coefficient. 
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We now prepare a dataset with entries from X as the features and the regions thus formed (involving l^'s) as 
the classes, as shown in Fig. |5] (left). Inducing a classification tree from this dataset really corresponds to growing a 
second tree to match the first tree at the leaves, as depicted in Fig. |4] (second frame). In this case, the second tree also 
learns a 3-paitition and we can evaluate each of these matchings using the Jaccai^d's measure. This produces three 
redescriptions: 

(X3nXi)u(X4-X3) ^ (Ya - 12) U (Fi - ys) 

(^O-Xs-Xi) ^ {YsnY2) 

aU of which hold at Jaccard's coefficient 1. This need not be the case in general. The bottom tree might be able 
to match only some paths in the top tree, or the matches might not pass our Jaccard's cutoff. This process is then 
continued, now with y^'s as features and the partitions derived from the bottom tree as classes (see right of Fig.l^Jl- 
The new matchings yield the redescriptions: 

(X3nXi)U(X4-X3) ^ 

(0_X3-X4) ^ {Yi-Y^) 
(X3-X1) ^ (o-yg-y^) 

which, fortuitously, also have a Jaccard's coefficient of 1. Notice that, this time, the root decision node that has been 
picked is y4 (see third frame of Fig.© and the tree actually resembles a decision list (a tree where every internal node 
has a leaf on its 'yes' branch). The alternation can be continued (see next section for ways to configure the search). 

If we limit the size of the trees at every iteration, it is easy to see that the set-expressions constructed cannot get 
arbitrarily long. In our running example, we use a depth limit of 2 so that all expressions on either side of a mined 
redescription can involve at most three descriptors. The longest expressions result from unions of two paths involving 
different subtrees. 

2.2 The CARTwheels Algorithmic Framework 

Why does CARTwheels work? The use of trees to mine one-directional implications (rules) is well understood and 
is the idea behind algorithms such as C4.5 fTOll . In CARTwheels, we exploit the duality between class partitions and 
path partitions to posit the stronger notion of equivalence. In fact, if a tree reduces the entropy to zero, it is clear 
that there must be a one-to-one correspondence between its path partitions and class partitions, which are really path 
partitions from the other tree. Keep in mind that different paths are union-ed when they predict the same class, and 
this property is crucial to establishing the duality. 

The search for redescriptions in CARTwheels can be viewed as a problem of identifying (and creating) correlated 
random variables. A descriptor, e.g., D, can be considered to be a discrete random variable that takes on values 
from O. Every object in D occurs with probability and other objects occur with probabiUty zero, to yield total 
probability mass of 1. Notice that this makes the self entropy of such a random variable to be the logarithm of the size 
of the descriptor. Now consider running a CARTwheels alternation with a depth limit of 1 for the classification trees. 
Mining a redescription with Jaccard's coefficient of 1 means that we have identified a random variable D' whose 
entropy distance from D is zero. The entropy distance fisll is given by: 

H{D,D')-I{D-D') 

where H{D, D') is the joint entropy function of {D, D'} and / qualifies the mutual information, in turn given by: 

I{D;D') = H{D) - H{D\D') 
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where H{D) is the self-entropy of D and H{D\D') is the conditional entropy of D given D' . In other words, the 
average reduction in uncertainty about D due to knowing D' is exactly the self entropy of D, causing an entropy 
distance of 0. Entropy distance is a true distance measure, unlike measures such as the KuUback-Leibler (KL) 
divergence. Smaller values of entropy distance hence imply higher values of Jaccard's coefficient. 

When we increase the depth limit, the analysis gets complicated because the redescription mining problem as 
stated is severely underconstrained. Any successful algorithm is required to anive at both the set theoretic expressions 
as well as test them for equivalence. The growing of classification trees with boolean features multiplexes the steps of 
constructive induction and guarantees implication, while the duality of partitions helps set the stage for equivalence 
testing. Ideally we would like to exercise precise control over the sequence in which the algorithm explores options, 
in order to qualify the nature of the mined redescriptions. 

2.2.1 Modeling CARTwheels Alternation 

Towards this end, the operation of CARTwheels can be modeled as a Markov process since the choice of next state 
is a function of only the cun^ent state (and perhaps global information such as O, X, and Y). This means that we can 
reduce the search for potential redescriptions to the design of a suitable state exploration policy. 

What does 'state' mean in modeling the operation of CARTwheels? Refer back to the duality of path partitions 
and class partitions - either of them could be used as a representation of state. The representation can be given as a 
label vector employing some canonical ordering of the objects in O (to ensure uniqueness). In addition, we have the 
option of including the descriptors for these partitions (in terms of Xj's or yj's) as part of the state representation. 
This is important when the same partition is realized by different descriptors, but must be considered distinct for 
redescription mining purposes (a simple example arises when X orY have elements that are exactly the same). We 
employ this approach in our studies. 

2.2.2 Designing an Exploration Policy 

Once the representational issue is decided, the more fundamental question pertains to the design of a suitable ex- 
ploration policy. In contrast to traditional classification tree induction which is motivated at reducing entropy, 
CARTwheels must actually maintain entropy in some form, since impurity drives exploration. 

Should CARTwheels attempt to find all redescriptions? This is clearly a tall order, and is reminiscent of the 
difficulties encountered in association rule mining Q, where the number of rules generated can quickly become 
unwieldy. Attempting to do this in the CARTwheels framework is unappealing since every set expression postulated 
by one tree must be matched by every expression modelable in the other tree! This will require multiple visitations 
of the same state and, while we can interleave the testings for matches to a certain extent, would involve enormous 
overhead in book-keeping. Instead we can exploit the algebraic structure of the problem to identify a minimal 
generating set of redescriptions, and design the policy to only visit the relevant states for this purpose. This is similar 
to the strategy pursued by Zaki for mining a non-redundant set of association rules l24ll . How this can be done 
effectively for redescriptions is the topic of a future paper. 

A second approach forsakes the desire to explore all redescriptions, and instead exploits the property that a 
redescription can be viewed as a subset of O x O space, i.e., a binary relation on O. Here, instead of computing all 
possible equivalences, we only find enough redescriptions to cover this space a specified number of times. We have 
to be careful here because redescriptions occur in two flavors. A redescription with Jaccard's coefficient 1 is a strong 
one, and has a complementary redescription — with both left side and right side expressions negated — that will also 
be strong. A redescription with Jaccard's coefficient < 1 is approximate and might hold in only one complement. For 
instance, if both sides of the redescription cover, say, 90% of the objects in O, then a very high Jaccard's coefficient 
can result purely by chance! Needless to say, the complementary redescription involving the remaining 10% of the 
objects may not hold. The net effect is that some redescriptions might imply a complete cover of O x O space, 
whereas others will only cover subsets. A workable criterion of coverage hence requires careful study. 
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In this paper, we employ a simpler exploration policy where descriptors participating in a path (but not the leaf) 
and yielding a good redescription are removed from consideration in subsequent alternations. In experimental tests, 
we have observed that this greedy policy guarantees a rapid exhaustion of the sets X and Y. We also place a limit on 
the number of alternations that the algorithm can pass through without finding any redescriptions. Once this Unlit is 
exceeded (happens after many useful entries from X and Y have been deleted), the algorithm terminates. 

CARTwheels also employs randomization heuristics to facilitate state exploration. For instance, when assigning 
class labels after inducing a tree, we take care to ensure that the same label is not assigned to all leaves, and suitably 
randomize assignments toward this purpose. This might appear counter-intuitive, but notice that it only has the effect 
of re-organizing the partitions derived from the paths, and can be seen as buying time for the Markov chain. Another 
example pertains to how decision nodes are selected for inclusion in a tree. While we use entropy as the primary driver 
for ti"ee induction, we sometimes perform randomized moves at the root level, in order to prevent over-dominance of 
one descriptor in the ensuing redescriptions. 

2.3 Implementation Details 

CARTwheels is implemented in C-i~i- atop a Postgres database providing access to the descriptors. We use an AD-tree 
data structure [TT^ for fast counting purposes and estimation of entropy (this is distinct from the classification tree that 
combines the descriptors). The AD-tree provides access to the distributions of 'class labels' for every combination of 
'features' and, since the definition of features and class labels change at every iteration, is rebuilt continually. Notice 
that the data structure is expected to provide both the sizes of descriptors as well as their negations (when we follow 
the 'no' branch) and hence, the depth of the AD-tree is set to just greater than the allowable depth of the classification 
trees. The CARTwheels algorithm consults the AD-tree whenever it must make a choice of a decision node (except 
when its move is exploratory). After evaluating matchings, set-expressions read off the trees are subjected to tabular 
minimization, in order to anive at a canonical form. 

The implementation allows for configuring the space of redescriptions that are explored. The depth limit for the 
top and bottom trees can be individually specified, and we can also preferentially include or exclude certain types of 
expressions in mined redescriptions. For instance, syntactic constraints on redescriptions (e.g., only conjunctions are 
allowed) can be incorporated as biases in the tree construction phase of CARTwheels. 

3 Applications in Bioinformatics 

We now present an appUcation of CARTwheels to studying gene expression datasets from microarray experiments 
conducted on the budding yeast Saccharomyces cerevisiae. Bioinformatics is fertile ground for application of 
CARTwheels and S. cerevisiae is arguably the most well studied (and documented) model organism thi^ough bioin- 
formatics techniques. Practically every experimental methodology applied towards yeast can be viewed as a way to 
define descriptors. Even the results of other data analysis/mining algorithms can be used as a source of descriptors! 
The underlying universal set of objects could be initialized to the set of genes, proteins, or processes, in S. cerevisiae. 
CARTwheels hence brings many computational and experimental technologies to bear upon redescription mining. It 
supports the capture of both similarities and distinctions among descriptors derived from these diverse sources. 

The redescription process begins by defining an universal set of genes O, which is dependent on our biological 
goals. Here, we are interested in characterizing similarities and differences in yeast gene expression behavior across 
related families of stresses. Gasch et al. |9l is an important source for such a study since it provides results from 
more than 170 comparisons, across a variety of environmental stresses. We selected five stresses from this dataset 
(heat shock from 25°C to 37°C, hyper-osmotic shock, hypo-osmotic shock, H2O2 exposure, and mild heat shock at 
variable osmolarity) and initialized O to be the set of genes that show significant expression (more than 1-fold up- or 
down-expression) in some time point in each of these stresses. This results in a set of 74 genes/ORFs. 



7 



The choice of the universal set can be viewed as a conditioning context and must be kept in mind when interpreting 
any mined redescriptions. It can be viewed as an implicit descriptor occurring on both sides of every redescription, 
e.g., E -i^ FinO can be viewed as E n O <^ F n O. 

We defined 824 descriptors, in a variety of ways. One class of descriptors was derived from categories in the 
GO biological process, GO cellular component, and GO molecular function taxonomies, that have representation 
among the chosen 74 genes. This yields a total of 378 descriptors (210 GO BIO + 42 GO CELL + 126 GO MOL). 
The microarray results from the five stresses of Gasch et al. i9j were bucketed to yield range descriptors of the form 
'expression level G [%x, 0] in time point %y of stress experiment %z' (for negative %x) and 'expression level S [0, 
%x] in time point %y of stress experiment %z' (for positive %x). This produces 224 descriptors. Further, k-means 
clustering was performed using the Genesis software suite fJTl on each of the stresses individually, with a setting of 10 
clusters. Since heat shock and mild heat shock at variable osmolarity are actually pairs of experiments, this step yields 
(5+2) X 10 = 70 descriptors depicting clusters of genes with similar- time profiles. Finally, we included microarray 
results from a histone depletion experiment conducted by Wyrick et al. l23l and created 152 range descriptors similar 
to the Gasch stresses; this is to allow us to relate the effect of histone depletion to that of environmental stresses. 

To invoke CARTwheels, we initialized X to be all descriptors derived from the Gasch et al. dataset (which 
includes the range descriptors as well as the k-means clusters). This ensures that all redescriptions will involve some 
aspect of the Gasch et al. experiment and prevents the possibility of, say, mining a redescription between two GO 
taxonomies. Y was initialized to the set of all descriptors; thus, there is some overlap between X and Y. In order 
to prevent obvious redescriptions arising from this overlap, the algorithm was precluded from utilizing descriptors in 
one tree if they are already present in the other tree. 

We employed a Jaccard's threshold of 0.5 and a depth-limit of 2 in both the top and bottom tree induction 
alternations. The limit on the number of allowable alternations till a redescription is mined is set to 10. Redescriptions 
inferred from CARTwheels are required to hold in both the mined and complementary forms. For example, for the 
equivalence Ei U E2 <^ F to be considered as a redescription, it must hold with Jaccard's coefficient at least 0.5, 
as must its complement: -lEi n -'E2 <^ -'F. This ensures that every redescription truly induces a partition of O. 
Thus whetted, redescriptions are then subjected to a 'tightening' step, akin to rule pruning in packages like C4.5. 
This might involve attempting to drop terms from both sides of the redescription, or restricting range descriptors (if 
they occur in the redescription), and determining whether this causes significant degradation of Jaccai^d's coefficient. 
If no degradation is observed, then the redescription can be tightened. With these design choices, and the greedy 
exploration policy, CARTwheels terminates after using 150 of the 824 descriptors, yielding about 200 redescriptions. 

Seven key mined redescriptions (R1-R7) are depicted in Fig. |5] They were selected for both their biological 
interest as well as for their feature construction novelties. Rl is a redescription between the GO taxonomy and 
experimental stresses from the Gasch dataset, and involves two genes. The rectangular region on the right side is 
bounded by the extremal values specific to the experiment, and hence is captured by a conjunction of merely two 
descriptors. From a biological perspective, Rl is interesting because it relates contrasting behavior in two different 
experimental comparisons (positive in heat shock 10 minutes, and negative in hypo-osmotic shock) to a GO biological 
category related to stress. 

R2 is actually a chain of two redescriptions, mined in successive iterations of CARTwheels. This redescription 
involves 1 1 genes and relates the disjunction of two different GO biological categories to expression data across three 
different stresses (this time, all involving positive expression). Notice that one of the derived expression descriptors 
is also a disjunction. It is pertinent to note that R2 experienced some tightening of its range descriptors - this is why 
one of its expressions has an extra term than would be expected for a 2-level tree. 

R3 satisfies our curiosity about the similarity between the histone depletion experiment and a Gasch comparison 
(heat shock). It involves 7 genes, two of which are hypothetical and one with a putative annotation. Such redescrip- 
tions involving un-annotated genes are important for suggesting testable hypotheses about their functionality. 
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Figure 5: Seven redescriptions mined from gene expression studies on Saccharomyces cerevisiae. Each box gives 
a readable statement of the redescription, presents it in graphical form, and identifies the genes conforming to the 
redescription. The Jaccard's coefficient is displayed over the redescription arrow. Notice that some redescriptions 
(e.g., Rl, R5) involve only two genes, whereas others such as R2 involve larger numbers. 
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The remaining redescriptions involve cluster profiles on one or both sides. R4 relates a k-means cluster to a 
set difference of two related GO cellular component categories. Interestingly, two of the genes participating in this 
redescription (YDR342C and YHR096C) ai^e singled out by the next redescription (R5), which identifies a different 
k-means cluster to characterize these genes; and which also uses a set-difference, this time of related GO molecular 
function categories. 

R6 is another chain of redescriptions, similar to R2, and relates a particular trend in the Gasch dataset to positively 
expressed genes in three different time points of the histone experiment. It involves 7 genes. Finally, R7 is actually a 
triangle of redescription relationships that illustrates the power of CARTwheels. Three different experimental com- 
parisons are involved in this circular chain of redescriptions, with 10 genes being implicated in all three descriptors. 
From a biological standpoint, this is a very interesting result - the common genes indicate concerted participation 
across sti'ess conditions; whereas the genes participating in, say, two of the descriptors, but not the third, suggest a 
careful diversification of functionality. 

4 Discussion 

This paper is a first exploration into the formulation of the redescription mining problem and has presented an ap- 
proach for mining redescriptions automatically. Redescriptions can be thought of as generalizations of one-directional 
implications (e.g., association rules rules in ILP ifTSl ). where one descriptor is required to be a proper subset of the 
other. This generalization coupled with the automatic identification of set-theoretic constructions makes CARTwheels 
a very powerful approach to mining (approximate) equivalence relations. We have demonstrated the effectiveness of 
CARTwheels in a domain that exhibits a richness of descriptors, and shown how it captures patterns involving small 
as well as large sets of objects. 

The work presented here can be considered a significant extension of ideas pursued in the schema matching (201, 
clustering categorical data [7], and model management [2] literature. The relationships considered in schema match- 
ing research are primarily of the foreign key nature or otherwise operate at the instance level, whereas we consider 
more complex set-theoretic relationships. Clustering categorical data focuses on defining similarity measures in non- 
metric spaces and this research can be fruitfully integrated with our work. However, notice that we are not merely 
clustering data but also imposing describability constraints. Model management is a framework that recognizes the 
complex inter-relationships that would exist in multi-database enterprises and provides union, intersection, and differ- 
ence operators for reconciliation, integration, and migration purposes. The relationships here are assumed to be user 
provided, and the emphasis is on actually 'executing a redescription.' CARTwheels can thus be usefully employed 
here as a driver for determining what these relationships should be. 

We now outline some directions for future research. The connection between Jaccard's coefficient and algorithmic 
driver parameters (such as entropy) deserves further study. Other ways of evaluating redescriptions lUTI IT3l are also 
pertinent here (e.g.. Dice coefficient) and some of these could support more efficient tree-based algorithms than the 
Jaccard's coefficient. Ideally, an evaluation metric would obey some closure properties in the space of redescriptions, 
which can be used to configure an exploration strategy. In addition, it is preferable that an evaluation metric lends 
itself to the design of a statistical test of significance for redescriptions. 

Thus far, we have assumed a 'flat' organization of the given descriptors and do not recognize any structural 
relationships between them. However, some descriptor vocabularies (e.g., derived from GO) enjoy a hierarchical 
structure, which can be exploited by the mining algorithm. Specialized redescription algorithms can thus be designed 
for targeted descriptor families. 

There is an intrinsic limit to a dataset's potential to reveal redescriptions, which can be studied through statistical 
analysis of set size distributions and estimates of overlap potential. Of particular interest here is qualifying the 
'expected' results from a CARTwheels alternation before actually performing the alternation; the entropy rate of the 
stochastic process underlying the Markov chain j5l can be a useful indicator in this regard. 
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Our current focus is on using redescriptions to automatically span multiple levels of abstraction (e.g., gene subsets 
— > pathways biological processes). This would firmly establish the importance of redescription in bridging the 
diverse levels at which information is created and characterized. 
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